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We  propose  the  multiple  LUT  cascade  as  a  means  to  configure  an  n-input  LPM  (Longest  Prefix 
Match)  address  generator  commonly  used  in  routers  to  determine  the  output  port  given  an  address. 
The  LPM  address  generator  accepts  n-bit  addresses  which  it  matches  against  k  stored  prefixes. 
We  implement  our  design  on  a  Xilinx  Spartan-3  FPGA  for  n  =  32  and  k  =  504  511.  Also,  we 

compare  our  design  to  a  Xilinx  proprietary  TCAM  (ternary  content-addressable  memory)  design  and 
to  another  design  we  propose  as  a  likely  solution  to  this  problem.  Our  best  multiple  LUT  cascade 
implementation  has  5.17  times  more  throughput,  40.71  times  more  throughput/area  and  is  2.97  times 
more  efficient  in  terms  of  area-delay  product  than  Xilinx’s  proprietary  design,  but  its  area  is  only 
15%  of  Xilinx’s  design.  Furthermore,  we  derive  a  method  to  determine  the  optimum  configuration  of 
the  multiple  LUT  cascade  on  an  FPGA. 
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1  Introduction 

The  need  for  higher  internet  speeds  is  likely  to  be  the  subject  of  intense  interest 
for  many  years  to  come.  A  network’s  speed  is  directly  related  to  the  speed  with 
which  a  node  can  switch  a  packet  from  an  input  port  to  an  output  port.  This, 
in  turn,  depends  on  how  fast  a  packet’s  address  can  be  accessed  in  memory. 
The  longest  prefix  match  (LPM)  problem  is  one  of  determining  the  output 
port  address  from  a  list  of  prefix  vectors  stored  in  memory.  For  example, 
if  the  prefix  vector  01001****  is  stored  in  memory,  then  the  packet  address 
010011111  matches  this  entry.  That  is,  each  bit  in  the  packet  address  matches 
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exactly  the  corresponding  bit  in  the  prefix  vector  or  there  is  a  *  or  don’t  care  in 
that  position.  If  other  stored  prefixes  match  the  packet  address,  then  the  prefix 
with  the  least  don’t  care  values  determines  the  output  port  address.  That  is, 
the  memory  entry  corresponding  to  the  longest  prefix  match  determines  the 
output  port. 

An  ideal  device  for  this  application  is  a  ternary  content-addressable 
memory  (TCAM)  (Pagiamtzis  et  al.  2006,  Song  et  al.  2005).  The  descriptor 
’’ternary”  refers  to  the  three  values  stored,  0,  1,  and  *.  In  (Kasnavi  et  al. 

2005) ,  the  authors  proposed  pipelined  TCAMs  for  the  longest  prefix  match  to 
increase  TCAM  efficiency.  In  (Wang  et  al.  2005),  the  authors  used  a  TCAM 
and  a  small  DRAM  for  the  longest  prefix  match  to  reduce  the  required  size 
of  TCAM.  Unfortunately,  TCAM  still  dissipates  more  power  than  standard 
RAM  (Renesas  2005). 

Several  authors  have  proposed  the  use  of  standard  RAM  in  LPM  de¬ 
sign.  Gupta,  Lin,  and  McKeown  showed  a  mechanism  to  perform  LPM 
every  memory  access  (Gupta  et  al.  1998).  Dharmapurikar,  Krishnamurthy, 
and  Taylor  have  proposed  the  use  of  Bloom  filters  to  solve  the  LPM 
problem  (Dharmapurikar  et  al.  2003).  Sasao  and  Butler  have  shown  that 
a  fast,  power-efficient  TCAM  realization  using  a  look-up  table  (LUT)  cas¬ 
cade  (Sasao  et  al.  2006). 

In  this  paper,  we  propose  an  extension  to  the  LUT  cascade  realization: 
a  multiple  LUT  cascade  realization  that  consists  of  multiple  LUT  cascades 
connected  to  a  special  encoder.  This  offers  even  more  efficient  realizations  in 
an  architecture  that  is  more  easily  reconfigured  when  additional  prefix  vectors 
are  placed  in  the  prefix  table. 

We  have  implemented  six  LPM  address  generators  on  the  Xilinx  Spartan-3 
FPGA  (XC3S4000-5):  Four  using  multiple  LUT  cascades,  one  using  Xilinx’s 
TCAM  realization  based  on  the  Xilinx  IP  core,  and  one  using  registers  and 
gates.  In  addition,  we  compare  the  six  LPM  address  generators  on  the  basis  of 
delay,  de/aj/- area  product,  throughput,  throughput /area,  and  FPGA  resources 
used. 

A  preliminary  version  of  this  paper  was  presented  at  ARC2006  (Qin  et  al. 

2006) .  We  extend  these  results  by  introducing  the  optimum  configuration  of 
the  multiple  LUT  cascade,  and  by  showing  how  to  realize  the  optimum  multiple 
LUT  cascade  on  an  FPGA. 

The  rest  of  the  paper  is  organized  as  follows:  Section  2  describes  the  multiple 
LUT  cascade.  Section  3  shows  other  realizations  for  the  LPM  address  gener¬ 
ators.  Section  4  presents  the  implementations  of  the  LPM  address  generator 
using  an  FPGA.  Section  5  shows  the  experimental  results.  Section  6  discusses 
the  optimum  configuration  of  the  multiple  LUT  cascade  implemented  on  an 
FPGA.  And  finally.  Section  7  concludes  the  paper. 
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Table  1.  LPM  table. 


2  Multiple  LUT  cascades 
2.1  LPM  address  generators 

A  content-addressable  memory  (CAM)  (Shafai  et  al.  1998)  stores  O’s  and  I’s 
and  produces  the  address  of  the  given  data.  A  TCAM,  unlike  a  CAM,  stores 
O’s,  I’s,  and  *’s,  where  *  is  a  don’t  care  value  that  matches  both  0  and  1. 

TCAMs  are  extensively  used  in  routing  tables  for  the  internet.  A  routing 
table  specifies  an  interface  identiher  corresponding  to  the  longest  prehx  that 
matches  an  incoming  packet,  in  a  process  called  Longest  Prefix  Match 
(LPM).  In  the  LPM  table,  the  ternary  vectors  have  restricted  patterns:  the 
prefix  consists  of  only  O’s  and  I’s,  and  the  posthx  consists  of  only  *’s  {don’t 
cares).  In  this  paper,  this  type  of  vector  is  called  a  prefix  vector. 

Definition  2.1  An  n-input  m-output  /c-entry  LPM  table  stores  k  n-element 
prefix  vectors.  To  assure  that  the  longest  prefix  address  is  produced,  TCAM 
entries  are  stored  in  descending  prefix  length,  and  the  first  match  starting  from 
the  top  of  the  table  determines  the  LPM  table’s  output.  An  address  is  an  m- 
element  binary  vector  for  m  =  |'log2(A:  -|-  1)],  where  [a]  denotes  the  smallest 
integer  greater  than  or  equal  to  a.  The  corresponding  LPM  function  is  a 
logic  function  /  :  5™,  where  f{x)  is  the  smallest  address  of  an  entry 

that  is  identical  to  x  except  possibly  for  don’t  care  values.  If  no  such  entry 
exists,  f{x)  =  O™.  The  LPM  address  generator  is  a  circuit  that  realizes 
the  LPM  function. 

Example  2.2  Table  1  shows  an  LPM  table  with  5  4-element  prefix  vectors. 
Table  2  shows  the  corresponding  LPM  function.  It  has  16  entries,  one  for  each 
4-bit  input.  The  output  address  is  stored  for  each  input  corresponding  to  the 
address  of  the  longest  prefix  vector  that  matches  it.  □ 


Address 

Prefix  Vector 

1 

1000 

2 

010* 

3 

01^5 

4 

5 

Table  2.  LPM  function. 


Input 

Output 

Address 

Input 

Output 

Address 

0000 

5 

1000 

1 

0001 

5 

1001 

4 

0010 

5 

1010 

4 

0011 

5 

1011 

4 

0100 

2 

1100 

4 

0101 

2 

1101 

4 

0110 

3 

1110 

4 

0111 

3 

1111 

4 

2.2  An  LUT  cascade  realization  of  LPM  address  generators 

An  LPM  function,  such  as  that  shown  in  Table  2,  can  be  realized  by  a  sin¬ 
gle  memory  which  operates  as  a  programmable  combinational  logic  circuit. 
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However,  this  often  requires  prohibitively  large  memory  size. 

Theorem  2.3  (Sasao  2006)  An  n-input  LPM  address  generator  with  k  prefix 
veetors  ean  be  realized  by  an  LUT  easeade,  where  eaeh  eell  realizes  a  p-input, 
r-output  eombinational  logie  funetion.  Let  s  be  the  neeessary  number  of  levels 
or  eells.  Then, 


s  < 


n  —  r 
p  —  r 


(1) 


where  p  >  r  and  r  =  |'log2(A:  +  1)] . 


2.3  LPM  address  generators  using  the  multiple  LUT  cascade 

A  single  LUT  cascade  realization  of  an  LPM  function  often  requires  many 
levels.  Since  the  delay  is  proportional  to  the  number  of  levels  in  a  cascade,  we 
wish  to  reduce  the  number  of  levels.  According  to  (1),  if  we  increase  p,  the 
number  of  inputs  to  each  cell,  then  the  number  of  levels  s  is  reduced.  For  each 
increase  by  1  of  p,  the  memory  needed  to  realize  the  cell  is  doubled.  However, 
as  shown  in  Figure  1,  we  can  use  a  multiple  LUT  cascade  to  reduce  the  number 
of  levels  s  while  keeping  p  fixed.  For  an  n-input  LPM  function  with  k  prefix 
vectors,  let  the  number  of  rails  of  each  LUT  cascade  be  r.  First,  starting  at 
the  top  of  the  LPM  table,  partition  the  set  of  prefix  vectors  into  g  groups  of 
2’’  —  1  vectors  each,  except  the  last  group,  which  has  2’’  —  1  or  fewer  vectors, 
where  g  =  \ ■  For  each  group  of  prefix  vectors,  form  an  independent  LPM 
function.  Next,  partition  the  set  of  n  inputs  into  s  groups.  The  inputs  within 
a  group  will  apply  to  a  single  cell  within  each  cascade.  Then,  realize  each 
LPM  function  by  an  LUT  cascade.  Thus,  we  need  a  total  of  g  LUT  cascades, 
where  each  LUT  cascade  consists  of  s  cells.  Finally,  use  a  special  encoder  to 
produce  the  LPM  address.  Let  Vi  {i  =  1,2, g)  be  the  i-th  input  of  the  special 
encoder  from  the  i-th  LUT  cascade,  and  let  Vout  be  the  output  value  of  the 
special  encoder.  That  is,  Vi  is  the  output  value  of  the  i-th  LUT  cascade,  where 
its  binary  output  values  are  viewed  as  a  standard  binary  number.  Similarly, 
Vout  is  the  output  of  the  special  encoder,  where  its  binary  output  values  are 
viewed  as  a  standard  binary  number.  Then,  we  have  the  relation: 

_(  Vi  +  {i  —  1)(2^  —  1)  if  Uj  /  0  and  vj  =  0  for  all  1  <  j  <  i  —  1 
Vout  Q  if  Uj  =  0  for  all  1  <  i  <  S'. 

Note  that  Vout  is  the  position  of  a  prefix  vector  v  in  the  complete  LPM  table, 
while  i  is  the  index  to  the  LUT  cascade  storing  v.  (i  — 1)(2’’  —  1)  is  the  position 
in  the  LPM  table  of  the  last  entry  of  the  previous  {i  —  l)-th  LUT  cascade  or 
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is  0  in  the  case  of  the  first  LUT  cascade.  Adding  Vi  to  this  yields  the  position 
of  V  in  the  complete  LPM  table. 

Example  2.4  Consider  an  n-input  LPM  function  with  k  prefix  vectors.  When 
k  =  1000  and  n  =  32,  by  Theorem  2.3,  we  have  r  =  10.  Let  p  =  r  +  1  =  11. 
When  we  use  a  single  LUT  cascade  to  realize  the  function,  by  Theorem  2.3, 
we  need  =  22  cells,  and  the  number  of  levels  of  the  LUT  cascade  is  also 

22.  Since  each  cell  has  11  address  lines  and  10  outputs,  the  total  memory  size 
needed  to  realize  the  cascade  is  2^^  x  10  x  22  =  450,  560  bits.  Note  that  the 
memory  size  of  each  cell,  2^^  x  10  =  20,480  bits,  is  too  large  to  be  realized  by 
a  single  block_RAM  (BRAM)  of  our  FPGA,  which  stores  18, 432  bits. 

However,  if  we  use  a  multiple  LUT  cascade  to  realize  the  function,  we  can 
reduce  the  number  of  levels  and  the  total  memory.  Also,  the  cells  will  fit 
into  the  BRAMs  in  the  PPG  As.  Partition  the  set  of  vectors  into  two  groups, 
and  realize  each  group  independently;  this  requires  two  LUT  cascades.  For 
each  LUT  cascade,  the  number  of  vectors  is  500,  so  we  have  r  =  9.  Also,  let 
p  =  r  +  2  =  11.  Then,  we  need  [pEyl  =  12  cells  in  each  cascade.  Note  that 
the  number  of  levels  of  the  LUT  cascades  is  12,  which  is  smaller  than  the 
22  needed  in  the  single  LUT  cascade  realization.  Since  each  cell  consists  of  a 
memory  with  9  outputs  and  at  most  11  address  lines,  the  total  memory  size  is 
at  most  2^^x9xl2x2  =  442,  368  bits.  Also,  note  that  the  size  of  the  memory 
for  a  single  cell  is  2^^  x9  =  18, 432  bits.  This  fits  exactly  in  the  BRAMs  of  the 
FPGAs. 

Thus,  the  multiple  LUT  cascade  not  only  reduces  the  number  of  levels  and 
the  total  memory,  but  also  reduces  the  size  of  cells  to  fit  into  the  available 
memory  in  the  FPGAs.  □ 

Fig.  1  shows  the  multiple  LUT  cascade  realization.  It  consists  of  multi¬ 
ple  LUT  cascades  and  a  special  encoder.  The  inputs  of  each  LUT  cascade  are 
common  with  other  LUT  cascades,  while  the  outputs  of  each  LUT  cascade  are 
connected  to  the  special  encoder.  Each  LUT  cascade  realizes  an  LPM  func¬ 
tion,  while  the  special  encoder  generates  the  LPM  address  from  the  outputs 
of  cascades. 

The  detailed  design  of  each  LUT  cascade  is  shown  in  Fig.  2.  Here  Xi  [i  = 

1, 2, ...,  s)  denotes  the  primary  inputs  to  the  i-th  cell,  di  {i  =  1,  2, ...,  s)  denotes 
the  data  inputs  to  the  i-th  cell  and  provides  the  data  value  to  be  written  in  the 
RAM  of  the  i-th  cell,  r  denotes  the  number  of  rails,  where  r  <  |'log2(|  -|-  1)], 

Cj  (j  =  2,3,  ...,s)  denotes  the  additional  inputs  to  the  j-th  cell  and  is  used 
to  select  the  RAM  location  along  with  Xj  for  write  access.  Note  that  Cj  and 

di  are  represented  by  r  bits.  All  RAMs  except  perhaps  the  last  one  have  p 
address  lines;  the  last  RAM  has  at  most  p  address  lines.  When  WE  is  high, 
Cj  is  connected  to  the  RAM  through  a  MUX,  allowing  data  to  be  written  into 
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Figure  1.  Architecture  of  the  multiple  LUT  cascade. 


Table  3.  6-entry  LPM  table. 


Address 

Prefix  Vector 

1 

100000 

2 

10010* 

3 

1010** 

4 

5 

6 

Table  4.  Truth  table  for  the  corresponding  LPM  function. 


XI 

X2 

Input 

X3  X4 

X5 

XG 

OUt2 

Output 

outi 

outo 

LUT 

Cascade 

1 

0 

0 

0 

0 

0 

0 

0 

1 

Upper 

1 

0 

0 

1 

0 

* 

0 

1 

0 

Cells  1 

1 

0 

1 

0 

* 

* 

0 

1 

1 

and  2 

1 

0 

1 

1 

* 

* 

1 

0 

0 

Lower 

1 

0 

0 

* 

* 

* 

1 

0 

1 

Cells  3 

1 

1 

* 

* 

* 

* 

1 

1 

0 

and  4 

the  RAMs.  When  WE  is  low,  the  outputs  of  the  RAMs  are  connected  to  the 
inputs  of  the  succeeding  RAMs  through  a  MUX,  and  the  circuit  is  a  cascade 
that  realizes  the  LPM  function.  Note  that  the  RAMs  are  synchronous  RAMs. 
Therefore,  the  LUT  cascade  resembles  a  shift  register. 

Example  2.5  Table  3  shows  a  6-input  3-output  6-entry  LPM  table,  and  the 
truth  table  of  the  corresponding  LPM  function  is  shown  in  Table  4.  Note 
that  the  entries  in  the  two  tables  are  similar.  Table  4  is  a  compact  truth  table, 
showing  only  non-zero  outputs.  Its  input  combinations  must  be  disjoint.  Thus, 
the  two  tables  are  the  same  except  for  three  entries. 
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(a)  Single  LUT  cascade  realization  (b)  Multiple  LUT  cascade  realization 


Figure  3.  Single  LUT  cascade  realization  and  the  multiple  LUT  cascade  realization. 


Single  Memory  Realization:  The  number  of  address  lines  is  6,  and  the 
number  of  outputs  is  3.  Thus,  the  total  amount  of  memory  is  2®  x  3  =  192 
bits. 

Single  LUT  Cascade  Realization:  Since  there  are  k  =  6  prefix  vectors, 
by  Theorem  2.3,  the  number  of  rails  is  r  =  [log2  (6  +  1)]  =  3.  Let  the  number 
of  address  lines  for  the  memory  in  a  cell  be  p  =  4.  By  partitioning  the  inputs 
into  three  disjoint  sets  {xi,X2,X‘i,Xi}^  and  {xg},  we  have  the  cascade  in 

Fig.  3  (a).  For  simplicity,  only  the  signal  lines  for  the  cascade  realization  are 
shown.  Other  lines,  such  as  for  storing  data,  are  omitted. 

The  total  amount  of  memory  is  2^  x  3  x  3  =  144  bits,  and  the  number  of 
levels  is  s  =  3.  Note  that  the  single  LUT  cascade  requires  75%  of  the  memory 
needed  in  the  single  memory  realization. 

Multiple  LUT  Cascade  Realization:  Partition  Table  3  into  two  parts, 
each  with  three  prefix  vectors.  The  number  of  rails  in  the  LUT  cascades  as¬ 
sociated  with  each  separate  LPM  table  is  [log2  (3  -|-  1)]  =  2.  Let  the  number 
of  address  lines  for  the  memory  in  a  cell  be  p  =  4.  By  partitioning  the  inputs 
into  two  disjoint  sets  {xi,  X2,  X3,  X4}  and  {xg,  xg},  we  obtain  the  realization  in 
Fig.  3  (b).  The  upper  LUT  cascade  realizes  the  upper  part  of  the  Table  4,  while 
the  lower  LUT  cascade  realizes  the  lower  part  of  the  Table  4.  The  contents  of 
each  cell  is  shown  in  Table  5. 

Let  ui  be  the  output  value  of  the  upper  LUT  cascade,  let  V2  be  the  output 
value  of  the  lower  LUT  cascade,  and  let  Vout  be  the  output  value  of  the  special 
encoder.  Then,  in  Table  5,  {zi,Z2)  viewed  as  a  standard  binary  number,  has 
value  ui,  while  (2:3,  Z4)  viewed  as  a  standard  binary  number,  has  value  V2-  The 
special  encoder  generates  the  LPM  address  from  the  pair  of  outputs,  {zi^Z2) 
and  {Z3,Z4)  : 


OUt2  =  ZIZ2{Z3  V  Z4) 
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Table  5.  Truth  tables  for  the  cells  in  the  multiple  LUT  cascade  realization. 


1  Cell  1  and  Cell  2  (upper  LUT  cascade)  | 

1  Cell  3  and  Cell  4  (lower  LUT  cascade)  | 

1  Xi  X2  X3  X4 

2/1  222 

1^5  1 

1-^1  22  1 

Vi 

'^out 

1  Xi  X2  X2  X4 

2/3  2/4 

1  xq\ 

1  ^3  Z4:\ 

V2 

'^out 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

001 

1 

0 

1 

1 

0 

0 

0 

1 

1 

100 

1 

0 

0 

1 

0 

1 

0 

* 

1 

0 

2 

010 

1 

0 

0 

* 

0 

1 

* 

1 

0 

2 

101 

1 

0 

1 

0 

1 

0 

* 

1 

1 

3 

on 

1 

1 

* 

* 

1 

0 

1 

1 

3 

no 

Other  values 

1 

1 

* 

* 

0 

0 

0 

t 

Other  values 

1 

1 

* 

0 

0 

0 

t 

Other  values 

0 

0 

0 

t 

Other  values 

0 

0 

0 

t 

t  depends  on  values  from  the  other  LUT  cascade. 


outi  =  zi  V  Z2Z3Z4, 
onto  =  ^2  V  Z1Z3Z4. 

Note  that  {out2,  outi,  onto)  viewed  as  a  standard  binary  number,  has  value  Vout 
corresponding  to  the  address  in  Table  3.  The  total  memory  size  is  2^  x  2  x  4  = 
128  bits,  and  the  number  of  levels  is  2.  Note  that  the  multiple  LUT  cascade 
realization  requires  89%  of  the  memory  and  one  fewer  level  than  the  single 
LUT  cascade  realization.  □ 


3  Other  realizations 
3.1  Xilinx’s  TCAM 

Xilinx  (Website  of  Xilinx)  provides  a  proprietary  realization  of  a  TCAM  that 
is  produced  by  the  Xilinx  CORE  Generator  tool.  Since  a  TCAM  can  directly 
realize  an  LPM  address  generator,  we  compare  our  proposed  multiple  LUT 
cascade  realization  with  Xilinx’s  TCAM.  In  the  Xilinx  CORE  Generator  7.1i, 
we  used  the  following  parameters  to  produce  TCAMs. 

•  Implementation:  SRL16. 

•  Mode:  Standard  ternary  mode  to  generate  a  standard  ternary  CAM. 

•  Depth:  k,  the  number  of  words  or  vectors  stored  in  the  TCAM. 

•  Data  width:  n,  the  number  of  bits  in  words  or  vectors. 

•  Mateh  Address  Type:  Binary  encoded. 

•  Address  Resolution:  Lowest. 


3.2  Registers  and  gates 

We  also  compare  our  proposed  multiple  LUT  cascade  realization  with  a  direct 
realization  using  registers  and  gates,  as  shown  in  Eig  4.  We  use  a  register  pair 
(Reg.  1  and  Reg.  0)  to  store  each  digit  of  a  ternary  vector.  Eor  example,  if 
the  digit  is  *  {don’t  eare),  the  register  pair  stores  (1,1).  Thus,  for  n  bit  data, 
we  need  a  2n-bit  register.  The  comparison  circuit  consists  of  an  n-input  AND 
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Figure  4.  Realization  of  the  address  generator  with  registers  and  gates. 


gate  and  n  1-bit  comparison  circuits,  each  of  which  produces  a  1  if  and  only  if 
the  input  bit  matches  the  stored  bit  or  the  stored  bit  is  don’t  care  (*  or  11). 

For  each  prefix  vector  of  an  n-input  LPM  address  generator,  we  need  a  2n- 
bit  register,  n  1-bit  comparison  circuits,  and  an  ri-input  AND  gate.  For  an 
n-input  address  generator  with  k  registered  prefix  vectors,  we  need  k  2n-bit 
registers,  nk  1-bit  comparison  circuits,  and  k  n-input  AND  gates.  In  addition, 
we  need  a  priority  encoder  with  k  inputs  and  [log2  (/c  -|-  1)]  outputs  to  generate 
the  LPM  address.  If  the  n-input  AND  gate  is  realized  as  a  cascade  of  2-input 
AND  gates,  this  circuit  can  be  considered  as  a  special  case  of  the  multiple  LUT 
cascade  architecture,  where  r  =  1,  p  =  2,  and  g  =  k.  Note  that  the  output 
encoder  circuit  is  a  standard  priority  encoder. 


4  FPGA  implementations 

We  implemented  the  LPM  address  generators  for  32  inputs  and  504~511  regis¬ 
tered  prefix  vectors  on  Xilinx  Spartan-3  FPGAs  (XC3S4000-5)  in  three  ways, 
the  multiple  LUT  cascade,  Xilinx  CORE  Generator  7.1i,  and  registers  and 
gates.  XC3S4000-5  (Spartan-3  FPGA  data  sheet  2005)  has  96  BRAMs  and 
27,648  slices.  Each  BRAM  contains  18K  bits,  and  each  slice  consists  of  two 
4-input  LUTs,  two  D-type  flip-flops,  and  multiplexers.  Eor  each  implementa¬ 
tion,  we  described  the  circuit  by  Verilog  HDL,  and  then  used  Xilinx  ISE  7.1i 
to  synthesize  and  to  perform  place  and  route. 

Eirst,  we  used  the  multiple  LUT  cascade  to  realize  the  LPM  address  gener¬ 
ators.  To  use  the  BRAMs  in  the  PPGA  efficiently,  we  chose  the  memory  size 
of  a  cell  in  the  LUT  cascade  not  to  exceed  the  size  of  a  BRAM  unit.  Let  p  be 
number  of  address  lines  of  the  memory  in  the  cell.  Since  each  BRAM  contains 
2^^  X  9  bits,  we  have  the  relation:  2^  -r  <  2^^  x  9,  where  r  is  the  number  of  rails. 
Thus,  we  have  p  =  [log2  (9/r)J  -|-  11,  where  [aj  denotes  the  largest  integer  less 
than  or  equal  to  a. 
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Table  6.  Four  multiple  LUT  cascade  realizations. 


Design 

Number  of  prefix  vectors 

r 

P 

Group 

Level 

r6pll 

504 

6 

11 

8 

6 

r7pll 

508 

7 

11 

4 

7 

rSpll 

510 

8 

11 

2 

8 

r9pll 

511 

9 

11 

1 

12 

r  \  Number  of  rails 

p:  Number  of  address  lines  of  the  RAM  in  a  cell 
Group:  Number  of  LUT  cascades 


Figure  5.  Realization  of  rSpll. 


We  designed  four  LPM  address  generators  r6pll,  r7pll,  rSpll,  and  rdpll, 
as  shown  in  Table  6,  where  the  column  Number  of  prefix  vectors  denotes 
the  number  of  registered  prefix  vectors,  the  column  r  denotes  the  number  of 
rails,  the  column  p  denotes  the  number  of  address  lines  of  the  RAM  in  a  cell, 
the  column  Group  denotes  the  number  of  LUT  cascades,  and  the  column 
Level  denotes  the  number  of  levels  or  cells  in  the  LUT  cascade. 

To  explain  Table  6,  consider  rSpll  which  is  shown  in  Fig  5.  For  rSpll,  since 
the  number  of  rails  is  r  =  8,  the  number  of  groups  is  [  =  2.  Thus,  we  need 

two  LUT  cascades.  Since  each  LUT  cascade  consists  of  8  cells,  the  number  of 
levels  of  rSpll  is  8.  To  efficiently  use  BRAMs  in  the  FPGA,  the  number  of 
address  lines  of  the  RAM  in  the  cell  is  set  to  p  =  [log2  (9/8)J+  11  =  11.  Let 
vi  be  the  value  of  the  outputs  of  the  upper  LUT  cascade,  let  V2  be  the  value  of 
the  outputs  of  the  lower  LUT  cascade,  and  let  Vout  be  the  value  of  the  outputs 
of  the  special  encoder.  Then,  we  have  the  relation: 

_  \  V2  +  255  if  f  1  =  0  and  V2  /  0, 

'^out  y  otherwise. 

The  circuit  realizing  this  expression  requires  11  slices  on  the  FPGA.  For  the 
whole  circuit,  rSpll  requires  16  BRAMs  and  69  slices.  From  this  table,  we 
can  see  that  decreasing  r,  increases  the  number  of  groups,  but  decreases  the 
number  of  levels. 

Next,  we  used  the  Xilinx  GORE  Generator  7.1i  tool  to  produce  Xilinx’s 
TGAM.  Since  the  Xilinx  GORE  Generator  7.1i  does  not  support  TGAMs  with 
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32  inputs  and  505~511  registered  prefix  vectors,  we  designed  a  TCAM  with  32 
inputs  and  504  registered  prefix  vectors.  The  resulting  TCAM  required  8,590 
slices.  Note  that  Xilinx’s  TCAM  requires  one  clock  cycle  to  find  a  match. 

Finally,  we  designed  the  LPM  address  generator  with  n  =  32  inputs  and 
A:  =  511  registered  prefix  vectors  using  registers  and  gates,  as  shown  in  Fig  4. 
This  design  is  denoted  Reg-Gates.  Note  that  the  number  of  inputs  is  32  and 
the  number  of  outputs  is  9.  This  design  required  27,646  slices. 

5  Performance  and  comparisons 

In  Table  7,  we  show  the  performance  of  multiple  LUT  cascade  realizations  (i.e., 
r6pll,  r7pll,  rSpll,  and  rdpll),  and  compare  them  with  Xilinx’s  TCAM  and 
Reg-Gates.  In  Table  7,  the  column  Level  denotes  the  number  of  levels  or 
cells  in  the  LUT  cascade,  the  column  Slice  denotes  the  number  of  occupied 
slices,  the  column  Memory  denotes  the  amount  of  memory  required,  and 
the  column  F_clk  denotes  the  maximum  clock  frequency.  The  column  tco 
denotes  the  maximum  clock-to-output  propagation  delay.  (It  is  the  maximum 
time  required  to  obtain  a  valid  output  at  the  output  pin  that  is  fed  by  a  register 
after  a  clock  signal  transition  on  an  input  pin  that  clocks  the  register).  The 
column  tpd  denotes  the  maximum  propagation  time  from  the  inputs  to  the 
outputs.  The  column  Th.  denotes  the  maximum  throughput.  Since  the  LPM 
address  generator  has  9  outputs,  it  is  calculated  as: 

Th.  =  9  •  F_clk. 

For  Reg-Gates,  Delay  denotes  the  maximum  delay  from  the  input  to  the 
output  and  is  equal  to  tpd.  For  multiple  LUT  cascade  realizations  and  Xilinx’s 
TCAM,  Delay  denotes  the  total  delay,  and  is  calculated  by: 

^  ,  1000  •  Level 

=  F_clk  + 

where  1000  is  a  unit  conversion  factor. 

Consider  the  area  occupied  by  the  various  realizations.  From  the  Spartan-3 
family  architecture  (Spartan-3  FPGA  data  sheet  2005),  we  can  see  that  the 
area  of  one  BRAM  is  at  least  the  area  of  16  slices  (a  slice  consists  of  two 
“4- input  LUTs”,  two  flip-flops,  and  miscellaneous  multiplexers). 

An  alternative  estimate  shows  that  the  area  of  one  BRAM  is  equivalent  to 
that  of  96  slices,  as  follows.  In  the  Xilinx  Virtex-II  FPGA,  one  “4-input  LUT” 
occupies  approximately  the  same  area  as  96  bits  of  BRAM  (also  containing 
18K  bits)  (Sproull  et  al.  2005).  Note  that  both  “4- input  LUTs”  and  BRAMs 
of  the  Virtex-II  FPGA  are  similar  to  those  of  the  Spartan-3  FPGA.  Thus, 
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Table  7.  Comparisons  of  FPGA  implementations  of  the  LPM  address  generator. 


Design 

Level 

Slice 

Memory 

(BRAM) 

F_clk 

(MHz) 

tco/tpd 

(ns) 

Th. 

(Mbps) 

Area^ 

(slice) 

Th./Area 

/  Mbps  s. 

^  slice  ' 

Delay 

(ns) 

Area-Delay 

(slice-^s) 

r6pll 

6 

178 

48 

103.89 

24.89 

(tco) 

935 

4786 

0.195 

82.64 

395.53 

r7pll 

7 

116 

28 

113.77 

23.46 

(tco) 

1024 

2804 

0.365 

84.99 

238.31 

rSpll 

8 

69 

16 

139.93 

20.91 

(tco) 

1259 

(best) 

1605 

0.785 

79.57 

127.71 

rdpll 

12 

99 

12 

139.08 

13.72 

(tco) 

1252 

1251 

(best) 

1.001 

(best) 

100.00 

125.10 

(best) 

Xilinx’s 

TCAM 

1 

8590 

22.52 

13.48 

(tco) 

203 

8590 

0.024 

57.88 

(best) 

497.23 

Reg- 

Gates 

27646 

58.67 

(tpd) 

27646 

58.67 

1621.99 

^We  assume  that  the  area  for  one  BRAM  is  equivalent  to  the  area  of  96  slices. 


we  can  deduce  that  one  BRAM  of  the  Spartan-3  FPGA  occupies  about  the 
same  area  as  192  (=  18  x  1024/96)  “4-input  LUTs”.  If  we  view  one  “4-input 
LUT”  as  approximately  one-half  a  slice  according  to  our  discussion  in  the 
previous  paragraph,  we  conclude  that  one  BRAM  has  about  the  same  area  as 
96  (=  192/2)  slices.  Thus,  two  estimates  of  the  area  for  one  BRAM,  16  and 
96  slices  are  quite  different.  For  this  analysis,  a  worst  case  of  96  slices/BRAM 
was  used. 

In  Table  7,  the  column  Area  denotes  the  equivalent  utilized  area,  where 
the  area  for  one  BRAM  is  equivalent  to  the  area  for  96  slices.  The  column 
Th./Area  denotes  the  efficiency  of  throughput  per  area  for  one  slice.  The 
column  Area-Delay  denotes  the  area- delay  product.  The  value  denoted  by 
best  shows  the  best  result. 

Xilinx’s  TCAM  has  the  smallest  delay,  but  requires  many  slices.  Reg-Gates 
has  almost  the  same  delay  as  Xilinx’s  TGAM,  but  requires  about  three  times 
as  many  slices  as  Xilinx’s  TGAM.  Note  that  Reg-Gates  requires  no  clock  pulses 
in  the  LPM  address  generation  operation,  while  the  others  are  sequential  cir¬ 
cuits  that  require  clock  pulses.  Since  the  delay  of  Reg-Gates  is  58.67  ns,  the 
equivalent  throughput  is  (1000/58.67)  x  9  =  153  (Mbps),  which  is  lower  than 
all  others. 

All  multiple  LUT  cascade  realizations  have  higher  throughput,  smaller  area, 
higher  throughput /area,  and  are  more  efficient  in  terms  of  area- delay  than  Xil¬ 
inx’s  TGAM.  r8pll  has  the  highest  throughput  and  the  smallest  delay  among 
all  multiple  LUT  cascade  realizations,  but  is  slightly  less  efficient  in  terms 
of  area-delay  than  r9pll.  r9pll  has  the  smallest  area,  the  highest  through¬ 
put/area,  and  the  highest  efficiency  in  terms  of  area-delay  among  all  realiza¬ 
tions.  Although  r8pll  has  almost  the  same  area- delay  as  r9pll,  its  area  is  28% 
more  larger  than  that  of  r9pll.  Hence,  r9pll  is  the  best  multiple  LUT  cascade 
realization  since  it  has  5.17  times  more  throughput,  40.71  times  more  through¬ 
put/area,  and  is  2.97  times  more  efficient  in  terms  of  area- delay  product  than 
Xilinx’s  TGAM,  while  the  area  is  only  15%  of  Xilinx’s  TGAM. 
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Table  8.  Memory-Delay  for  multiple  LUT  cascade  realizations. 


Design 

r6pll 

rlpll 

rSpll 

r9pll 

Area-Delay  (slice-^s) 

395.53 

238.31 

127.71 

125.10 

Memory-Delay  (BRAM-/ls) 

3.97 

2.38 

1.27 

1.20 

6  The  optimum  configuration  of  the  multiple  LUT  cascade 

Firstly,  consider  the  relation  between  the  required  memory  size  and  the  total 
area.  As  can  be  seen  from  Table  7,  for  rGpll,  which  has  the  most  complicated 
encoder,  the  memory  required  occupies  96.3%  of  the  total  area.  For 

r9pll,  the  memory  required  occupies  92.1%  of  the  total  area.  Note 

that  r9pl  1  has  the  smallest  proportion  of  the  area  for  memory  to  the  total  area 
among  all  the  multiple  LUT  cascade  realizations.  Thus,  the  memory  consumes 
no  less  than  92%  of  the  total  area.  In  addition,  as  shown  in  Table  7,  the 
size  of  Memory  is  approximately  proportional  to  the  Area.  Hence,  we  can 
assume  that  the  multiple  LUT  cascade  realization  with  the  smallest  memory 
size  corresponds  to  that  with  the  smallest  total  area.  Secondly,  consider  the 
relation  between  Area-Delay  and  Memory-Delay  product.  As  shown  in  Table  8, 
r9pll  has  both  the  smallest  Area-Delay  and  the  smallest  Memory -Delay  among 
all  multiple  LUT  cascade  realizations.  Note  that  the  value  of  Memory-Delay 
is  approximately  proportional  to  the  Area-Delay.  Thus,  we  can  assume  that 
the  realization  with  the  smallest  Memory-Delay  corresponds  to  that  with  the 
smallest  Area-Delay.  Therefore,  we  can  use  the  total  size  of  memory  required 
instead  of  the  total  area,  and  Memory-Delay  instead  of  Area-Delay  to  find 
the  optimum  multiple  LUT  cascade  realization.  Doing  this  allows  a  formal 
analysis,  as  shown  in  the  next  section. 


6.1  Total  size  of  memory 

Consider  the  multiple  LUT  cascade  implementation  of  an  n-input  LPM  ad¬ 
dress  generator  that  stores  k  prefix  vectors.  Let  each  cell  be  realized  as  a 
reconfigurable  memory  with  m  bits.  For  the  implementations  discussed  pre¬ 
viously  in  this  paper,  this  memory  is  a  BRAM  of  the  Spartan-3  FPGA, 
where  m  =  18, 432  bits.  Each  cell  in  the  LUT  cascade  has  r  outputs,  where 
r  <  riog2(^  +  1)1  •  With  m  bits  stored  in  each  memory  and  r  bits  per  word,  — 
words  are  stored  in  each  LUT  cell.  Therefore,  the  number  of  address  inputs 
for  each  LUT  cell  is  p{r)  =  [log2  .  Note  that  r  <  p{r)  —  1.  Let  M(r)  be  the 
total  memory  needed  to  implement  the  given  LPM  address  generator.  That 
is. 


M(r)  =  msg, 


(2) 
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where  s  =  \  is  the  number  of  cells  in  each  of  the  g  =  cas¬ 

cades  that  make  up  the  multiple  LUT  cascade  realization  of  the  LPM  address 
generator. 

Theorem  6.1  M{r)  is  a  monotone  decreasing  function  of  r  for  r  <  p{r)  —  2. 

Since  M{r)  is  monotone  decreasing  for  r  <  p{r)  —  2,  to  find  the  minimum 
M(r),  it  is  only  necessary  to  find  M(r)  for  r  =  p{r)  —  2  and  r  =  p{r)  —  1,  an 
upper  bound  in  r. 


6.2  Memory-delay  product 

From  Table  7,  we  observed  that  the  delay  in  an  n-input  LPM  address  generator 
is  given  approximately  as 


D  = 


1000 

F_dk 


{s  +  2), 


(3) 


where  F_clk  is  the  frequency  of  the  clock  in  MHz.  Let  MD{r)  be  the  memory- 
delay  product  of  the  multiple  LUT  cascade  realization  of  the  address  generator. 
Therefore, 


MD{r)  =  {msg){^^{s  +  2)),  (4) 

Theorem  6.2  MD{r)  is  a  monotone  decreasing  function  ofr  forr  <  p{r)  —  5. 
Specially,  when  p{r  —  1)  =p{r),  MD{r)  is  a  monotone  decreasing  function  of 
r  for  r  <  p{r)  —  3. 

Since  MD{r)  is  monotone  decreasing  for  r  <  p{r)  —  5,  to  find  the  minimum 
MD{r),  it  is  only  necessary  to  find  MD{r)  for  r  =  p{r)  —  i,  where  i=l,  2,  ..., 
5,  or  for  five  values  of  r.  Specially,  when  p{r  —  1)  =  p{r),  to  find  the  minimum 
MD{r),  we  only  need  to  consider  three  cases  for  r  =  p{r)  —  i,  where  i=l,  2, 
and  3. 


6.3  Optimum  multiple  LUT  cascade  for  the  BRAM  containing  18K 
bits,  16K  bits  or  fK  bits 

In  popular  FPGAs,  such  as  Xilinx’s  FPGAs  or  Altera’s  FPGAs,  the  sizes  of 
BRAM  are  18K  bits  or  4K  bits.  For  other  FPGAs,  the  size  of  BRAM  can  be 
16K  bits.  We  consider  these  three  types  of  BRAMs  in  the  following  discussion. 

For  16K-bit  BRAM,  from  p{r)  =  [log2  (16384/r)J ,  we  have  p{r)  =  10  for 
r  =  9  and  p{r)  =  11  for  5  <  r  <  8.  Since  p{r  —  1)  =  p{r)  =  11  for  5  <  r  <  8, 
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Table  9.  The  value  of  r  that  makes  MD{r)  minimum. 


Block_RAM  size 

The  minimum  Memory-Delay 

18K  bits 

r  =  rjmax  when  r.max  <  8 

r  =  r.optimal  when  rjmax  =  9 

4K  bits 

r  =  rjmax  when  rjmax  <  6 

r  =  r .optimal  when  rjmax  =  7 

16K  bits 

r  =  rjmax  for  r  <  8 

r^max  is  the  maximum  integer  r  that  satisfies  both  r  <  p(r)  —  1 
and  r  <  |'log2  (fc  +  1)],  where  p{r)  =  [log2  and  m  denotes  the 
size  of  a  BRAM. 


r_optimal  is  r  that  makes  s- g-  {s-\-2)  minimum,  where  s  = 
and  g  =  For  m  =  18K-bit  BRAM,  rjoptimal  can  be 

obtained  by  calculating  values  only  for  r  =  p{r)  —  2  =  9  and 
r  =  p{r)  —  3  =  8.  For  m  =  4K-bit  BRAM,  rjoptimal  can  be  obtained 
by  calculating  values  only  for  r  =  p(r)  —  2  =  7  and  r  =  p{r)  —  3  =  6. 


from  Theorem  6.2,  MD{r)  decreases  with  r  when  1  <  r  <  (11  —  3)  =  8. 
Let  C,{r)  =  s  ■  g  ■  {s  +  2),  where  s  =  \ and  g  =  We  can  verify 

C(8)  <  C(9)  when  n  >  15.  In  most  applications,  we  can  assume  that  n  >  16. 
Thus,  we  can  conclude  that  MD{r)  is  minimum  when  r  is  maximum,  where 
r  <  8. 

When  the  size  of  a  BRAM  is  m  =  18K  bits,  from  p{r)  =  [log2  (9/r)J  +11,  we 
have  p{r)  =  11  for  5  <  r  <  9.  When  m  =  4K  bits,  from  p{r)  =  [log2  (8/r)J  +9, 
we  have  p{r)  =  9  for  5  <  r  <  8  and  p{r)  =  10  for  r  =  4.  Thus,  for  both 
m  =  18K-bit  and  m  =  4K-bit,  we  have  p{r  —  1)  =  p{r)  when  r  <  p{r)  —  5. 
From  Theorem  6.2,  MD{r)  is  minimum  when  r  =  ll  — 2  =  9orr  =  ll— 3  =  8 
for  18K-bit  BRAM,  and  r  =  9  —  1  =  8,  r  =  9—2  =  7,  or  r  =  9  —  3  =  6  for  4K-bit 
BRAM.  We  can  verify  C(6)  <  C(8)  for  4K-bit  BRAM  when  n  >  14  .  In  most 
applications,  we  can  assume  that  n  >  16.  Thus,  we  only  need  to  consider  the 
case  of  r  =  p{r)  —  2.  Depending  on  the  values  of  n  and  k,  MD{r)  is  minimum 
when  r  =  p{r)  —  2  or  r  =  p{r)  —  3.  However,  for  an  LPM  address  generator 
with  fixed  n  and  k,  we  can  easily  obtain  an  r  that  minimizes  (((r)  =  s-(/-(s  +  2) 
by  calculating  the  values  for  r  =  p{r)  —  2  and  r  =  p{r)  —  3. 

Table  9  shows  the  values  of  r  that  minimize  MDir)  for  three  types  of 
BRAMs. 

In  Table  7,  Area-Delay  for  r9pll  and  r8pll  are  nearly  the  same.  Note  that 
when  p  =  11  and  n  =  32,  C,{p  —  2)  =  0.30  and  C,{p  —  3)  =  0.31  are  almost  the 
same  value. 

For  BRAM  containing  18K  bits.  Theorem  6.1  shows  that  the  area  for  r  =  9 
is  smaller  than  for  r  =  8.  If  MD{r)  for  r  =  9  is  almost  the  same  as  for  r  =  8, 
then  the  multiple  LUT  cascade  is  optimum  when  r  =  9.  Similar  to  4K-bit 
BRAM,  if  MD{r)  for  r  =  7  and  r  =  6  are  almost  the  same,  then  the  multiple 
LUT  cascade  is  optimum  when  r  =  7.  The  following  example  shows  the  design 
of  an  optimum  multiple  LUT  cascade. 
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Figure  6.  Optimum  realization  of  r9pllp4. 


Example  6.3  Consider  an  LPM  address  generator  with  n  =  32  and  k  =  2040 
implemented  on  a  Spartan-3  FPGA.  Note  that  the  size  of  a  BRAM  is  m  =  18K 
bits.  First,  from  p(r)  =  [log2  {^/r)\  -|-  11,  we  have  p{r)  =  11  when  5  <  r  <  9. 
To  obtain  the  optimal  Area-Delay  realization,  from  Table  9,  r  can  be  8  or  9 
when  n  =  32  and  k  =  2040.  Let  C,{r)  =  s  ■  g  ■  {s  -\- 2).  We  have  C(9)  =  672,  and 
C(8)  =  640.  Note  that  ("(9)  is  nearly  the  same  as  C(8).  Since  the  area  for  r  =  9 
is  minimum  from  Theorem  6.1,  the  multiple  LUT  cascade  is  optimum  when 
p  =  11  and  r  =  9.  In  this  case,  a  realization  with  g  =  ^ 

LUT  cascades  is  optimum.  Also,  the  number  of  levels  is  =  12, 

which  shows  that  each  LUT  cascade  consists  of  12  cells.  Finally,  we  need  a 
special  encoder.  Let  vi  be  the  value  of  the  outputs  of  the  top  LUT  cascade,  let 
V2  be  the  value  of  the  outputs  of  the  second  LUT  cascade,  let  be  the  value 
of  the  outputs  of  the  third  LUT  cascade,  let  U4  be  the  value  of  the  outputs  of 
the  fourth  LUT  cascade,  and  let  Vout  be  the  value  of  the  outputs  of  the  special 
encoder.  Then,  we  have  the  relation: 


Vout 


'  Vi  -h  1533 
V3  -h  1022 
U2  +  511 


if  vi  =  V2  =  V3  =  0  and  U4  /  0, 
if  vi  =  V2  =  0  and  V3  /  0, 
if  ui  =  0  and  V2  /  0, 
otherwise. 


The  optimum  realization  of  r9pllp4  is  shown  in  Fig  6.  □ 

We  also  implemented  the  LPM  address  generator  with  n=32  and  fe=2040 
on  Xilinx  Spartan-3  FPGAs  (XG3S4000-5).  Table  10  shows  that  r9pllg'4  has 
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Table  10.  FPGA  implementations  of  the  LPM  address  generator  with  n—32  and  k—  2040. 


Design 

Level 

Slice 

Memory 

(BRAM) 

F_clk 

(MHz) 

tco 

(ns) 

Th. 

(Mbps) 

Area.^ 

(slice) 

Th./Area 

/  Mbps  s. 

^  slice  ' 

Delay 

(ns) 

Area-Delay 

(slice- p.s) 

rSpllgS 

8 

299 

64 

111.20 

26.00 

1223 

6443 

0.190 

97.94 

631.04 

r9pllg4 

12 

241 

48 

CO 

CO 

23.00 

1225 

4849 

0.253 

130.79 

634.19 

^We  assume  that  the  area  for  one  BRAM  is  equivalent  to  the  area  of  96  slices. 
rSpllgS  denotes  the  FPGA  implementation  with  r=8,  p=ll,  and  g=S. 
rdpllgA  denotes  the  FPGA  implementation  with  r=9,  p=ll,  and  p'=4. 


almost  the  same  throughput  and  area-delay  as  rSpllgS,  but  its  area  is  only 
75%  of  rSpll^S.  In  addition,  rdpllgA  has  higher  throughput /area  than  that 
of  rSpllgS.  Thus,  r9pllg4:  is  the  optimum  realization  for  the  LPM  address 
generator  with  n=32  and  A:=2040. 


7  Conclusions 

In  this  paper,  we  presented  the  multiple  LUT  cascade  to  realize  LPM  address 
generators.  In  addition,  we  discussed  an  approach  to  obtain  the  optimum  con¬ 
figuration  of  multiple  LUT  cascade  on  FPGAs.  Although  we  illustrated  the 
design  method  for  n  =  32  and  k  =  504  ~  511,  it  can  be  extended  to  other 
values  of  n  and  k. 

We  implemented  four  LPM  address  generators  (i.e.  rdpll,  r7pll,  rSpll,  and 
rdpll)  on  the  Xilinx  Spartan-3  FPGA  (XG3S4000-5)  by  using  the  multiple 
LUT  cascade.  For  comparison,  on  the  same  type  of  FPGA,  we  also  imple¬ 
mented  Xilinx’s  proprietary  TGAM  and  Reg-Gates,  an  approach  proposed  by 
us  as  a  likely  solution  to  the  LPM  problem.  Xilinx’s  TGAM  has  the  small¬ 
est  delay,  but  requires  many  slices.  Reg-Gates  has  almost  the  same  delay  as 
Xilinx’s  TGAM,  but  requires  the  largest  area,  and  requires  about  three  times 
as  many  slices  as  Xilinx’s  TGAM.  All  multiple  LUT  cascade  realizations  have 
higher  throughput,  smaller  area,  higher  throughput /area  and  more  efficient  in 
terms  of  area-delay  product  than  Xilinx’s  TGAM. 
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